122

Applications in Natural Language Processing

FIGURE 5.2

Fully quantized transformer.

estimates computed during training. For every forward pass, xmin and xmax variables are

updated via an exponential moving average with a momentum of 0.9.

During backpropagation, the straight-through estimator [37] is used to Bypass the un-

differentiable round function, and the gradients of clamped values are set to zero.

5.2.2

What to Quantize

They choose to quantize all operations, which can provide a computational speed gain at

inference. The overview is presented in Fig. 5.2. In particular, they quantize all matrix mul-

tiplications, meaning that the inputs and weights of MatMuls will both be b-bit quantized.

The model’s divisions are also quantized as long as the numerator and denominator are